RAII resource handles for CUDA objects #1368

Andy-Jost · 2025-12-12T00:11:56Z

Summary

This PR introduces C++ RAII-based resource handles for CUDA objects in cuda.core. The handles use std::shared_ptr with custom deleters to ensure proper destruction order and prevent resource leaks.

For a detailed explanation of the design, see cuda_core/cuda/core/_cpp/DESIGN.md. (Tip: Select ⋯ → View file to see the formatted Markdown.)

Handle Types

ContextHandle: Wraps CUcontext with proper retain/release semantics for primary contexts
StreamHandle: Wraps CUstream, holds owning reference to its context
EventHandle: Wraps CUevent, holds owning reference to its context
MemoryPoolHandle: Wraps CUmemoryPool, includes nvbug 5698116 workaround (clears peer access before destruction)
DevicePtrHandle: Wraps CUdeviceptr, holds owning reference to memory pool (if applicable), includes IPC pointer cache

Driver Workarounds

nvbug 5570902: The IPC pointer cache works around a driver bug where multiple imports of the same device pointer via IPC are not reference-counted correctly.
Pinned memory IPC: The cache also enables multiple imports of pinned memory buffers, working around CUDA_ERROR_ALREADY_MAPPED that the driver returns on duplicate imports.

Benefits

Eliminates noisy tracebacks from incorrect destruction order (see #1303 comments)
Enables safe resource ownership transfer between Python and C++
Resources remain valid independent of Python garbage collection timing

Related Issues

Closes Implement a handle-based resource management layer #1303
Closes Buffer.handle should be intptr_t #1342
Related to from_handle constructors need to allow managing ownership #1187

Test Plan

All existing tests pass
Pre-commit checks pass
New test for IPC duplicate import (nvbug 5570902)
Manual verification of destruction order fixes

copy-pr-bot · 2025-12-12T00:12:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Create _context.pxd with Context class declaration - Move get_primary_context from _device.pyx to _context.pyx - Add Cython-level context functions: get_current_context, set_current_context, get_stream_context - Update _device.pyx to use context module functions - Update _stream.pyx to use context module functions - Remove push_context and pop_context (unused, replaced with direct CUDA API calls) - Reorganize _context.pyx according to style guide (principal class first)

- Replace Context._handle (object) with ContextHandle (shared_ptr) resource handle - Add handle property to Context returning driver.CUcontext - Update Context._from_ctx to create ContextHandle using create_context_handle() - Update Context.__eq__ to compare actual CUcontext values (not shared_ptr addresses) - Update Context.__hash__ to include type(self) and handle value with NULL safety - Update _device.pyx to use ctx._handle.get()[0] for direct access - Update _graph.pyx to use context.handle property - Update C++ implementation to use default deleter (simplifies code) - Rename _resource_handles_impl.cpp to _context_impl.cpp - Remove test_dev.py development script - Update .gitignore to allow *_impl.cpp files - Fix all test files to use context.handle instead of context._handle

- switch context helper APIs to return ContextHandle instead of raw CUcontext - add TLS wrapper for primary context caching using handles - update device/stream code to consume ContextHandle-based helpers - expose create_context_handle_ref as nogil-safe in the pxd

Introduce three helper functions for ContextHandle resource extraction: - native(h): Returns cydriver.CUcontext for use with cydriver API calls - py(h): Returns driver.CUcontext for use with Python driver API - intptr(h): Returns uintptr_t for internal APIs expecting integer addresses These helpers replace direct h_context.get()[0] calls, providing: - Cleaner, more semantic code - Consistent extraction pattern across all handle types - Type-safe conversions with clear intent Implementation details: - native() and intptr() are inline nogil functions in .pxd - py() requires Python module access, implemented in new _resource_handles.pyx - Updated all call sites in _context, _device, and _stream modules

Move get_primary_context/get_current_context into C++ with thread-local caching and conditional GIL release; inline create_context_handle_ref in the header; update Cython modules and build hooks to link handle users (including _device) against resource_handles and libcuda.

The C++ implementation in _resource_handles_impl.cpp is now compiled only into _resource_handles.so. Other modules that depend on these symbols (_context, _device, etc.) resolve them at runtime via the global symbol table. This ensures a single shared instance of thread-local caches and avoids setuptools issues with shared source files across extensions.

Move native(), intptr(), and py() from Cython inline functions to inline C++ functions in resource_handles.hpp. This enables function overloading when additional handle types (e.g., StreamHandle) are added. - native(): extract raw CUDA handle from ContextHandle - intptr(): extract handle as uintptr_t for Python interop - py(): convert handle to Python driver wrapper object

Add StreamHandle for automatic stream lifetime management using the same shared_ptr-based pattern established for ContextHandle. Changes: - Add StreamHandle type and create_stream_handle/create_stream_handle_ref functions in C++ with implementations in _resource_handles_impl.cpp - Add overloaded native(), intptr(), py() helpers for StreamHandle - Update Stream class to use _h_stream (StreamHandle) instead of raw _handle - Owned streams are automatically destroyed when last reference is released - Borrowed streams (from __cuda_stream__ protocol) hold _owner reference - Update memory resource files to use native(stream._h_stream) - Simplify Context using intptr() and py() helpers

- Move stream creation to C++ (create_stream_handle now calls cuStreamCreateWithPriority internally) - Add get_legacy_stream/get_per_thread_stream for built-in streams - Add create_stream_handle_with_owner for borrowed streams that prevents Python owner from being GC'd via captured PyObject* - Add GILAcquireGuard (symmetric to GILReleaseGuard) for safely acquiring GIL in C++ destructors - Simplify Stream class: remove __cinit__, _owner, _builtin, _legacy_default, _per_thread_default - Use _from_handle as single initialization point for Stream - Remove obsolete subclassing tests for removed methods

- Replace raw CUcontext _ctx_handle with ContextHandle _h_context for consistent handle paradigm and cleaner code - Replace CUdevice _device_id with int using -1 sentinel - Use intptr() helper instead of <uintptr_t>() casts throughout - Add _from_handle(type cls, ...) factory with subclass support - Add _legacy_default and _per_thread_default classmethods - Eliminate duplicated initialization code in _init

- Event now uses ContextHandle for _h_context instead of raw object - Event._init is now a cdef staticmethod accepting ContextHandle - Context._from_ctx renamed to Context._from_handle (cdef staticmethod) - Moved get_device_from_ctx to Stream module as Stream_ensure_ctx_device - Inlined get_stream_context into Stream_ensure_ctx - Simplified context push/pop logic in Stream_ensure_ctx_device Naming standardization: - Device._id -> Device._device_id - _dev_id -> _device_id throughout codebase - dev_id -> device_id for local variables - Updated tests to use public APIs instead of internal _init methods

Device now stores its Context in _context slot, set during set_current(). This ensures Device holds an owning reference to its context, enabling proper lifetime management when passed to Stream and Event creation. Changes: - Add _context to Device.__slots__ - Store Context in set_current() for both primary and explicit context paths - Simplify context property to return stored _context - Update create_event() to use self._context._h_context - Remove get_current_context import (no longer needed in _device.pyx) Add structural context dependency to owned streams StreamBox now holds ContextHandle to ensure context outlives the stream. This structural dependency is only for owned streams - borrowed streams delegate context lifetime management to their owners. C++ changes: - StreamBox gains h_context member - create_stream_handle(h_ctx, flags, priority) takes owning context - create_stream_handle_ref(stream) - caller manages context - create_stream_handle_with_owner(stream, owner) - Python owner manages context Cython/Python changes: - Stream._init() accepts optional ctx parameter - Device.create_stream() passes self._context to Stream._init() - Owned streams get context handle embedded in C++ handle

Event now uses EventHandle (shared_ptr) for RAII-based lifetime management, following the same pattern as Stream. C++ changes: - Add EventHandle type alias and EventBox struct - Add create_event_handle(h_ctx, flags) with context captured in deleter - Add create_event_handle_ipc(ipc_handle) for IPC events (no context dep) - Add native(), intptr(), py() overloads for EventHandle Cython changes: - Event._h_event replaces raw CUevent _handle - _init() uses create_event_handle() - from_ipc_descriptor() uses create_event_handle_ipc() - close() uses _h_event.reset() - Keep _h_context for cached fast access

- Simplified branch structure: early return for Event, single path for Stream - Use native() helper for handle access instead of casting via handle property - Temporary events now use EventHandle with RAII cleanup (no explicit cuEventDestroy) - Added create_event_handle import

- New overload takes only flags (no ContextHandle) for temporary events - Delegates to existing overload with empty ContextHandle - Updated _stream.pyx and _memoryview.pyx to use simpler overload - Removed unnecessary get_current_context import from _memoryview.pyx - Removed unnecessary Stream_ensure_ctx call from Stream.wait()

C++ layer (resource_handles.hpp/cpp): - Add MemoryPoolHandle = std::shared_ptr<const CUmemoryPool> - Add create_mempool_handle(props) - owning, calls cuMemPoolDestroy on release - Add create_mempool_handle_ref(pool) - non-owning reference - Add create_mempool_handle_ipc(fd, handle_type) - owning from IPC import - Add get_device_mempool(device_id) - get current pool for device (non-owning) - Add native(), intptr(), py() overloads for MemoryPoolHandle Cython layer: - Update _resource_handles.pxd with new types and functions - Update _device_memory_resource.pxd: replace raw handle with MemoryPoolHandle - Reorder members: _h_pool first (matches Stream/Event pattern) - Update _device_memory_resource.pyx to use new handle functions - Update _ipc.pyx to use create_mempool_handle_ipc for IPC imports - DMR_close now uses RAII (_h_pool.reset()) instead of explicit cuMemPoolDestroy - Consistent member initialization order across __cinit__, init functions, and close

Introduce DevicePtrHandle (std::shared_ptr<const CUdeviceptr>) to manage device pointer lifetimes with automatic deallocation. Key features: - Allocation functions: deviceptr_alloc_from_pool, deviceptr_alloc_async, deviceptr_alloc, deviceptr_alloc_host, deviceptr_create_ref - IPC import via deviceptr_import_ipc with error output parameter - Deallocation stream stored in mutable DevicePtrBox, accessible via deallocation_stream() and set_deallocation_stream() - cuMemFreeAsync used for deallocation (NULL stream = legacy default) - Buffer class updated to use DevicePtrHandle instead of raw pointers - Buffer.handle returns integer for backward compatibility with ctypes - IPCBufferDescriptor.payload_ptr() helper to simplify casting Note: IPC-imported pointers do not yet implement reference counting workaround for nvbug 5570902.

Change all intptr() overloads to return std::intptr_t (signed) instead of std::uintptr_t per C standard convention for pointer-to-integer conversion. This addresses issue NVIDIA#1342 which requires Buffer.handle to return a signed integer. Fixes NVIDIA#1342

Implement a systematic error handling approach for C++ resource handle functions using thread-local storage, similar to cudaGetLastError(). API: - get_last_error(): Returns and clears the last CUDA error - peek_last_error(): Returns without clearing - clear_last_error(): Explicitly clears the error All functions that can fail now set the thread-local error before returning an empty handle. This allows callers to retrieve specific CUDA error codes for proper exception propagation. Updated deviceptr_import_ipc to use this pattern instead of an output parameter.

IPC-imported device pointers are not correctly reference counted by the driver - the first cuMemFreeAsync incorrectly unmaps the memory even when the pointer was imported multiple times. Work around this by caching imported pointers and returning the same handle for duplicate imports. The cache uses weak_ptr so entries are automatically cleaned up when all references are released. The workaround can be easily bypassed via use_ipc_ptr_cache() when a driver fix becomes available.

Implements handle-based owner tracking for device pointers, consistent with the pattern used for streams (create_stream_handle_with_owner). Changes: - Add deviceptr_create_with_owner() - creates non-owning handle that keeps a Python owner alive via Py_INCREF/DECREF (lambda capture) - If owner is nullptr, delegates to deviceptr_create_ref - Buffer._owner field tracks owner in Python for property access - Buffer._init() simplified to always call deviceptr_create_with_owner

Andy-Jost · 2025-12-15T18:09:56Z

/ok to test b629ec6

Expose a full C++ handles function table via PyCapsule so extensions can dispatch without RTLD_GLOBAL, and switch resource_handles.cpp to load libcuda symbols at runtime to support CPU-only imports.

Use a lazy PyCapsule in _resource_handles to resolve and cache required CUDA driver entrypoints via cuda.bindings.driver.cuGetProcAddress, and have resource_handles.cpp consume that table on first use. This avoids duplicating driver pathfinding logic and removes dlopen/dlsym linkage requirements.

rwgk · 2026-01-07T17:08:28Z

The latest update:

Refactors py methods to reduce duplicate code

Renames native to cu

On the first point, I was able to get a thread-safe solution without explicit use of atomics. For instance:
static PyObject* cls = detail::py_class_for("CUcontext");
C++ ensures this initialization will occur at most once.

AFAICT this is probably fine for free-threaded Python, but it has the exact same latent deadlock risk in multi-threaded scenarios that we had to solve with pybind/pybind11#4877.

I'm fine if you don't want to add the robust code in this PR, but then we should create an issue to track the latent deadlock risk. We can work on that as a follow-on.

More background:

The latent deadlock risk is explained in detail here: https://github.com/pybind/pybind11/blob/master/docs/advanced/deadlock.md

The deadlock scenario:

Thread A holds GIL, enters static init, acquires C++ static init lock
Some Python C API function releases the GIL during execution
Thread B acquires GIL, enters same function, blocks on C++ static init lock
Thread A tries to reacquire GIL → deadlock

It will most likely work most of the time in practice, the module is usually already imported and the attribute access typically doesn't release the GIL. But based on my pybind11 experience, I'm convinced we're setting ourselves up for flakiness with this code. The pybind11 pattern "worked" for years until numpy 2 came around, and then suddenly became very flaky for reasons I still don't fully understand.

It seems easily avoidable. LLMs can generate a solution that avoids the latent deadlock risk in seconds (e.g., release GIL → std::call_once → reacquire GIL inside), and those few seconds plus ~20 lines of extra code are likely to save hours of dealing with flakiness in the future.

rwgk · 2026-01-07T17:15:38Z

Oh, sorry I missed this comment yesterday:

An alternative might be "Holder," as in StreamHolder, which also preserves the convenient h_stream naming pattern.

I really like Holder; based on what @cpcloud said better than Rc.

I lean toward cu, py, and intptr over the as_* variants since these accessors appear often in nested expressions. In this case, I think the readability benefit of having shorter lines and fewer line wraps outweighs the loss of descriptiveness since handles and their accessors are used so widely that most developers will quickly become familiar with them.

I want to throw in a couple extra arguments:

I have overlooked such short names completely many more times than I'd like to admit. "Too short to even be read"
It's much more difficult to pin-point cu or py in a large source code file than as_cu or as_py.

rwgk · 2026-01-07T17:43:27Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+// we avoid calling PyEval_SaveThread (which requires holding the GIL).
+// It also handles the case where Python is finalizing and GIL operations
+// are no longer safe.
+class GILReleaseGuard {


Minor suggestion: Add a Note documenting the behavior when the GIL isn't actually released:

// Note: If the GIL is not released (finalizing, or not held): // - Reduces parallelism (other Python threads remain blocked) // - No deadlock risk as long as the guarded code doesn't call back into Python

rwgk

Some additional pretty minor suggestions. Technically this PR looks great to me. My main concern is intuitive naming, I believe Holder is far better than Handle, for the reasons explained under "Handle" implies in my comment from a couple days ago.

rwgk · 2026-01-07T17:49:45Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+// Stream Handles
+// ============================================================================
+
+struct StreamBox {


The *Box structs (ContextBox, StreamBox, EventBox, MemoryPoolBox, DevicePtrBox) seem to be pure implementation details used only within this .cpp file. I'd move them into the existing anonymous namespace to provide stronger encapsulation.

rwgk · 2026-01-07T18:14:11Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+};
+
+static DevicePtrBox* get_box(const DevicePtrHandle& h) {
+    const CUdeviceptr* p = h.get();


Suggested comment (I'd put it inside the function):

// Recovers the owning DevicePtrBox from the aliased CUdeviceptr pointer. // This works because DevicePtrHandle is a shared_ptr alias pointing to // &box->resource, so we can compute the containing struct using offsetof. // The const_cast is safe because we only use this to access the mutable // h_stream member or in the deleter (where the box is being destroyed).

rwgk · 2026-01-07T18:15:09Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+
+struct DevicePtrBox {
+    CUdeviceptr resource;
+    mutable StreamHandle h_stream;


Suggested comment for h_stream:

// Mutable to allow set_deallocation_stream() to update the stream // through a const DevicePtrHandle. The stream can be changed after // allocation (e.g., to synchronize deallocation with a different stream).

rwgk · 2026-01-07T18:16:30Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+static std::once_flag driver_load_once;
+static bool driver_loaded = false;


I believe static is redundant inside an anonymous namespace. Harmless but I'd remove it.

I see there are also static declarations scattered elsewhere in the file, for example, static DevicePtrBox* get_box(...) and static void clear_mempool_peer_access(...) in namespace cuda_core. These helper functions (and the *Box structs they work with) are implementation details that don't need external visibility. Moving them all into the anonymous namespace would:

Group all implementation details in one place

Eliminate the need for scattered static declarations

Make the public API surface of namespace cuda_core clearer

I didn't go so far as to group everything into anonymous namespaces but I cleaned up and grouped shared things as much as possible. I kept the structure where each resource type has a section with its Box declaration, helper classes and functions, and public API functions.

rwgk · 2026-01-07T18:38:23Z

cuda_core/cuda/core/_device.pyx

+    cu,
+)
+
+_init_handles_table()


I think it'd be good to consistently highlight these calls (also in other files) with a comment, so they stand out to readers glancing through, and to refer back to the documentation, e.g.:

# prerequisite before calling handle API functions (see _cpp/Design.md) _init_handles_table() <empty line here>

- Refactor py() overloads to use detail::make_py() helper, eliminating code duplication - Remove static caching of Python class objects in py() to avoid deadlock between static initialization guard and GIL - Fix ensure_driver_loaded() by releasing GIL before call_once to ensure consistent lock order (guard -> GIL) - Add "Static Initialization and Deadlock Hazards" section to DESIGN.md with link to pybind11 documentation

cuda_core/cuda/core/_cpp/resource_handles.cpp

Andy-Jost · 2026-01-07T18:58:10Z

AFAICT this is probably fine for free-threaded Python, but it has the exact same latent deadlock risk in multi-threaded scenarios that we had to solve with pybind/pybind11#4877.

Thanks for pointing this out. This is a pretty subtle issue that could cause a headache later. I dealt with this in the latest upload as follows:

Removed the static class lookup in py functions. The deadlock risk is real and I don't think the extra complexity is justified for these functions. They are just nice-to-haves, not critical to performance.
I found an additional instance of this problem related to load_driver_api and updated it to release the GIL before entering std::call_once.
I updated the DESIGN.md document to point out the deadlock hazard.

As for the naming, let's put it up to a vote. I have a slight preference for the existing names, but I don't mind changing the code if others prefer the new names. I mainly want to avoid changing multiple times. There are two proposed name changes:

cu/py/intptr -> as_cu/as_py/as_intptr
StreamHandle -> StreamHolder etc.

The second one also involves renaming files, documentation, etc. If anyone has a preference either way, please mention it.

I will get to work on the remaining comments now.

- Add note to GILReleaseGuard about behavior when GIL isn't released - Add comment for mutable h_stream in DevicePtrBox - Add comment to get_box() explaining pointer recovery technique - Add _init_handles_table() comments to all .pyx files referencing DESIGN.md - Remove redundant static from items inside anonymous namespace - Wrap *Box structs in anonymous namespace for encapsulation

Andy-Jost · 2026-01-07T19:43:56Z

/ok to test 61326e9

rwgk · 2026-01-07T19:53:32Z

Logging this here FYI / for visibility.

Question

Do you see any danger of reference cycles between the new family of shared_ptr types (StreamHandle, ContextHandle, etc.)?

Analysis

The dependency graph among handle types is:

ContextHandle (no handle dependencies)
    ↑
    ├── StreamHandle (deleter captures ContextHandle)
    │       ↑
    │       └── DevicePtrHandle (box contains StreamHandle, deleter may capture MemoryPoolHandle)
    │
    └── EventHandle (deleter captures ContextHandle)

MemoryPoolHandle (no handle dependencies)

No cycles are possible among the C++ shared_ptr types. The dependencies form a DAG — all arrows point "upward" toward ContextHandle or are one-directional.

Key observations:

StreamHandle → ContextHandle, but ContextHandle never references StreamHandle
DevicePtrHandle → StreamHandle + MemoryPoolHandle, but neither of those reference DevicePtrHandle
StreamBox and EventBox only contain the raw CUDA handle, no other *Handle types

One Potential Concern: Python ↔ C++ Cycles

The _with_owner variants (create_stream_handle_with_owner, deviceptr_create_with_owner) capture a PyObject* via Py_INCREF. If user code creates a situation where:

Python object A holds a handle
That handle's deleter holds a ref to Python object B
B holds a reference to A

...this would be a cycle that Python's GC cannot break (because the C++ Py_INCREF is invisible to the GC's cycle detector).

However, this is a general issue with C extensions holding Python references, not specific to this design. The _with_owner pattern is intended for borrowing handles from foreign objects, where you wouldn't typically store the resulting handle back on the owner.

Conclusion

The C++ handle graph is cycle-free by design. Python-level cycles involving _with_owner are theoretically possible but unlikely in practice.

leofang

Sorry for jumping in late, would it be possible to give me a bit more time to review?

Andy-Jost · 2026-01-07T20:00:57Z

No cycles are possible among the C++ shared_ptr types.

Yes, the natural ownership relation implied by the driver should preserve this forever. The driver would need to introduce resources that mutually own one another to break this.

leofang · 2026-01-09T04:27:14Z

cuda_core/MANIFEST.in

 # SPDX-License-Identifier: Apache-2.0

-recursive-include cuda/core *.pyx *.pxd
+recursive-include cuda/core *.pyx *.pxd *.cpp *.hpp


Q: Would this include Cython-generated code? If so we don't want to unconditionally include .cpp files.

leofang · 2026-01-09T04:42:20Z

cuda_core/cuda/core/_event.pxd

    cdef:
-        cydriver.CUevent _handle
+        EventHandle _h_event
+        ContextHandle _h_context  # Cached for fast access


nit: the comment isn't quite accurate, the point of caching context handle is not about faster access, but because there is no way to use a driver API to loop it up (unlike cuStreamGetCtx)

leofang · 2026-01-09T04:58:28Z

cuda_core/cuda/core/_cpp/DESIGN.md

Thanks for writing up the markdown file, Andy. But the added complexity, especially around the new Cython bindings (to the resource handle C++ code), and the GIL handling, warrants at least another discussion, I think. This complexity was not mentioned in the original design docs (or maybe I missed it?), and I am not sure it is the best way to go. I'll schedule a meeting to follow up.

As discussed in the meeting, I'll follow up to see whether the capsules can be eliminated. See #1450 and #1452

cuda_core/cuda/core/experimental/_context.pxd

leofang · 2026-01-09T05:04:45Z

cuda_core/cuda/core/_cpp/resource_handles.hpp

+// ============================================================================
+
+// cu() - extract the raw CUDA handle
+inline CUcontext cu(const ContextHandle& h) noexcept {


would be better to avoid two-letter names like cu and py

As a general rule, yes, short names should be avoided. In this case, cu() and py() are ubiquitous operations used throughout the codebase, and longer names would arguably harm readability.

These functions are designed to be nested in expressions, so brevity matters.

For example:

HANDLE_RETURN(cydriver.cuMemcpyAsync( cu(dst._h_ptr), cu(self._h_ptr), src_size, cu(s._h_stream)))

With a longer name like get_cuda_handle():

HANDLE_RETURN(cydriver.cuMemcpyAsync( get_cuda_handle(dst._h_ptr), get_cuda_handle(self._h_ptr), src_size, get_cuda_handle(s._h_stream)))

The longer form requires more lines, more visual parsing, and potentially more temporary variables. Given how frequently these appear in driver call expressions, I believe the short names reduce cognitive load rather than increase it.

cu is too short and get_cuda_handle is too long:

too short: it's a pain to pin-point such names while reading the code. Searching for cu will have many false positives. cu( will be better, but still. I think even LLMs will have trouble pin-pointing all calls with a short name.

too long: cognitive load

The sweet spot is somewhere in the middle.

If we're trying to push this PR through today, I suggest we table this question and follow up, if needed.

leofang · 2026-01-09T05:10:44Z

cuda_core/cuda/core/_cpp/resource_handles.cpp

+// ============================================================================
+// Thread-local error handling
+// ============================================================================
+
+// Thread-local status of the most recent CUDA API call in this module.
+static thread_local CUresult err = CUDA_SUCCESS;
+
+CUresult get_last_error() noexcept {
+    CUresult e = err;
+    err = CUDA_SUCCESS;
+    return e;
+}
+
+CUresult peek_last_error() noexcept {
+    return err;
+}
+
+void clear_last_error() noexcept {
+    err = CUDA_SUCCESS;
+}


The cudart-style error handling seems a lot harder to use than the previous HANDLE_RETURN... What is the motivation of introduce a thread-local state?

I think this is another question we should follow up on, if needed.

Andy-Jost · 2026-01-09T18:32:35Z

Related issue for future simplification: #1450 (investigate using __pyx_capi__ to potentially eliminate the PyCapsule mechanisms)

Andy-Jost · 2026-01-09T18:37:39Z

Related issue: #1452 (investigate eliminating _CXX_API capsule by having Cython modules call directly into resource handles code)

- MANIFEST.in: Be specific about C++ files to avoid Cython-generated .cpp - _event.pxd: Remove inaccurate comment about caching for fast access - Remove unnecessary experimental/_context.pxd stub

Andy-Jost · 2026-01-09T19:07:51Z

/ok to test b3c6c64

Andy-Jost · 2026-01-09T19:10:29Z

That latest upload fixes issues from Leo's comments. Remaining open questions:

Whether to change two-letter names cu and py
Whether to change the error-reporting system in resource_handles
Whether we can eliminate the capsules (See Investigate using __pyx_capi__ to simplify resource handle architecture #1450 and Investigate eliminating _CXX_API capsule for resource handle functions #1452)

If we can resolve or defer these, then this PR is ready for merge AFAIK.

rwgk · 2026-01-09T19:20:16Z

That latest upload fixes issues from Leo's comments. Remaining open questions:

Whether to change two-letter names cu and py

Whether to change the error-reporting system in resource_handles

Whether we can eliminate the capsules (See Investigate using pyx_capi to simplify resource handle architecture #1450 and Investigate eliminating _CXX_API capsule for resource handle functions #1452)

I think the two-letter names cu and py question should be settled in this PR:

Super easy change now
But once it's "out there" I wouldn't want to touch it anymore.

My preference would still be to use as_cu, as_py, and for symmetry, as_intptr.

Rename cu() -> as_cu(), py() -> as_py(), intptr() -> as_intptr() to avoid overly short two-letter function names per review feedback.

Andy-Jost · 2026-01-09T20:42:27Z

/ok to test 10623ee

Andy-Jost · 2026-01-09T20:42:52Z

The latest upload mass-renames py, cu, and intptr to prefix as_.

leofang

Thanks, Andy/Ralf!

github-actions · 2026-01-10T03:08:06Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

Andy-Jost self-assigned this Dec 12, 2025

Andy-Jost force-pushed the resource-handles branch from 5413cc4 to 9344798 Compare December 12, 2025 16:03

Andy-Jost added 21 commits December 12, 2025 08:20

Fix lint issues: remove unused imports and variables

937428b

Andy-Jost force-pushed the resource-handles branch from 9344798 to 937428b Compare December 12, 2025 16:30

leofang added this to the cuda.core beta 11 milestone Dec 15, 2025

Andy-Jost added 2 commits December 16, 2025 12:55

Add resource handles _CXX_API capsule and lazy driver loading

cce5e9f

Expose a full C++ handles function table via PyCapsule so extensions can dispatch without RTLD_GLOBAL, and switch resource_handles.cpp to load libcuda symbols at runtime to support CPU-only imports.

rwgk reviewed Jan 7, 2026

View reviewed changes

rwgk approved these changes Jan 7, 2026

View reviewed changes

rparolin reviewed Jan 7, 2026

View reviewed changes

cuda_core/cuda/core/_cpp/resource_handles.cpp Show resolved Hide resolved

leofang requested changes Jan 7, 2026

View reviewed changes

Andy-Jost added 2 commits January 7, 2026 14:09

Merge branch 'main' into resource-handles

238d27c

Merge branch 'main' into resource-handles

53316d5

leofang reviewed Jan 9, 2026

View reviewed changes

Andy-Jost mentioned this pull request Jan 9, 2026

Investigate using __pyx_capi__ to simplify resource handle architecture #1450

Open

Andy-Jost mentioned this pull request Jan 9, 2026

Investigate eliminating _CXX_API capsule for resource handle functions #1452

Open

Andy-Jost added 2 commits January 9, 2026 10:39

Merge branch 'main' into resource-handles

40fcca7

Address review comments

b3c6c64

- MANIFEST.in: Be specific about C++ files to avoid Cython-generated .cpp - _event.pxd: Remove inaccurate comment about caching for fast access - Remove unnecessary experimental/_context.pxd stub

Rename handle accessor functions for clarity

10623ee

Rename cu() -> as_cu(), py() -> as_py(), intptr() -> as_intptr() to avoid overly short two-letter function names per review feedback.

leofang approved these changes Jan 10, 2026

View reviewed changes

leofang merged commit acc78f7 into NVIDIA:main Jan 10, 2026
80 checks passed

		static std::once_flag driver_load_once;
		static bool driver_loaded = false;

RAII resource handles for CUDA objects #1368

RAII resource handles for CUDA objects #1368

Uh oh!

Conversation

Andy-Jost commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Handle Types

Driver Workarounds

Benefits

Related Issues

Test Plan

Uh oh!

copy-pr-bot bot commented Dec 12, 2025

Uh oh!

Andy-Jost commented Dec 15, 2025

Uh oh!

rwgk commented Jan 7, 2026

Uh oh!

rwgk commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Andy-Jost commented Jan 7, 2026

Uh oh!

Andy-Jost commented Jan 7, 2026

Uh oh!

rwgk commented Jan 7, 2026

Question

Analysis

One Potential Concern: Python ↔ C++ Cycles

Conclusion

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Jan 9, 2026

Uh oh!

Andy-Jost commented Jan 9, 2026

Uh oh!

Andy-Jost commented Jan 9, 2026

Andy-Jost commented Dec 12, 2025 •

edited

Loading